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Abstract 

Background: Patient-specific aberrant expression patterns in conjunction with functional screening assays can 
guide elucidation of the cancer genome architecture and identification of therapeutic targets. Since most statistical 
methods for expression analysis are focused on differences between experimental groups, the performance of 
approaches for patient-specific expression analyses are currently less well characterized. A comparison of methods 
for the identification of genes that are dysregulated relative to a single sample in a given set of experimental 
samples, to our knowledge, has not been performed. 

Methods: We systematically evaluated several methods including variations on the nearest neighbor based 
outlying degree method, as well as the Zscore and a robust variant for their suitability to detect patient-specific 
events. The methods were assessed using both simulations and expression data from a cohort of pediatric acute B 
lymphoblastic leukemia patients. 

Results: We first assessed power and false discovery rates using simulations and found that even under optimal 
conditions, high effect sizes (>4 unit differences) were necessary to have acceptable power for any method (>0.9) 
though high false discovery rates (>0.1) were pervasive across simulation conditions. Next we introduced a 
technical factor into the simulation and found that performance was reduced for all methods and that using 
weights with the outlying degree could provide performance gains depending on the number of samples and 
genes affected by the technical factor. In our use case that highlights the integration of functional assays and 
aberrant expression in a patient cohort (the identification of gene dysregulation events associated with the targets 
from a siRNA screen), we demonstrated that both the outlying degree and the Zscore can successfully identify 
genes dysregulated in one patient sample. However, only the outlying degree can identify genes dysregulated 
across several patient samples. 

Conclusion: Our results show that outlying degree methods may be a useful alternative to the Zscore or Rscore in 
a personalized medicine context especially in small to medium sized (between 10 and 50 samples) expression 
datasets with moderate to high sample-to-sample variability. From these results we provide guidelines for detection 
of aberrant expression in a precision medicine context. 
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Background 

The use of functional assays such as the interrogation of 
patient-derived cancer cells against panels of small inter- 
fering RNA (siRNA) duplexes or small molecule inhibitors 
allows patients who are part of the same disease subgroup 
to be further stratified based on an assessment of the 
effect of gene down-regulation on cancer cell viability 
[1,2]. The advent of precision medicine represents a 
methodological paradigm shift from traditional detection 
of differences between experimental groups towards identi- 
fication of individual events or outliers (for example, indi- 
vidual expression patterns and patient-specific siRNA/drug 
sensitivities). Although some work has been done charac- 
terizing patient-specific dysregulation of pathways [3-6], 
univariate patient-specific analysis of gene expression has 
not been thoroughly explored. 

Arguably the most common type of analysis procedure 
applied to mRNA expression experiments is the determin- 
ation of putative differential expression [7-9]. However, 
even within specific subgroups of patients with cancer, the 
same genes are not always dysregulated in the same 
manner in every specimen. Individual expression patterns 
can reflect underlying mutation, chromosomal rearrange- 
ment and copy number events. This shifts the focus to 
a different type of analysis procedure: identification of 
a single sample or small subgroups that have divergent 
expression from the rest of the group (for example, the 
detection of candidate oncogenic chromosomal aberrations 
on the basis of outlier gene expression in prostate cancer 
[10]). Many procedures have been devised to detect the 
latter situation with earliest efforts, cancer outlier profile 
analysis (COPA) [10] and the outlier sum (OS) [11], fo- 
cused on prioritization after a robust standardization 
procedure. Others have extended this to robust t or F 
tests [12-16] or similar procedures [17-20]. Additionally, 
the problem has also been viewed as one of population 
or proportional differences between two groups [21-23]. 
Recently, the anti-profile method was developed to look 
for genes with high variability across samples and used to 
discriminate colon cancer cases from controls [24]. A 
limitation of these procedures is that they assume both a 
control as well as an experimental group though several, 
including OS, COPA and the very recently described 
mCOPA [25], will work with only one group. Others 
have focused on the observation that, in the presence 
of outlying subgroups of patients for a given gene, the 
distribution would become bi- or multimodal [26-28]. 
Effective parameter estimation for such mixture models 
would require substantial sample sizes thereby limiting 
these approaches to large, well-defined cohorts. Addition- 
ally, general methods originally devised in other fields such 
as the outlying degree (OD) [29,30] or the gene tissue index 
[31] can be used in a gene-wise univariate context for 
finding outlying subgroups. However all of these methods. 



with the exception of the OD method, provide a ranking 
of genes for a given cohort, not for a specific sample 
within the cohort. Searching for outliers or 'hits' for a given 
sample is a common procedure for some types of experi- 
ments, such as genome-wide siRNA screens. Two proce- 
dures used for these experiments are a Z-transformation 
(Zscore) or robust Zscore (Rscore) along with a cutoff 
dictating outlier or hit status [32]. Both methods have been 
applied to microarray analysis as well. For instance, the 
Zscore approach was first applied to microarray datasets 
a decade ago [33] and still is used for sample-specific 
analyses as implemented in the cBio web portal [34]. 
We also note that the Rscore is the first step of the 
COPA and OS methods with outlier status determined 
empirically [10,11]. 

Complicating the search for outlying subgroups is the 
fact that microarrays as well as other high throughput 
assays can be sensitive to many technical factors [35-39]. 
In addition, expression differences between samples can be 
caused by many potentially confounding factors regarding 
clinical samples such as gender, ethnicity and age as well as 
differences in tissue or cell preparation. A concrete example 
of such an effect leading to expression differences among 
two subgroups is the non-coding RNA XIST, which is 
highly expressed in females but has almost negligible levels 
in males [40]. Although effective methods exist to correct 
both known [41] and unknown [42] factors, it may still be 
important to consider overall sample dissimilarities when 
the expression values of single genes are compared between 
samples and/or groups. This will become an even more 
important issue as we move towards a precision medicine 
clinical paradigm where it is likely that sample processing 
would immediately follow acquisition rather than forming 
balanced batches (in terms of relevant covariates) that can 
be randomized. 

In this paper, we consider the question of how to detect 
genes that exhibit aberrant expression for a subset of 
patients focusing on the situation where the subset 
contains only a single patient sample. We perform simula- 
tions testing the effectiveness of multiple approaches 
including the widely used Zscore and Rscore as well as 
weighted and unweighted variants of the OD method. 
We first evaluate these approaches, simulating a wide 
variety of conditions, and show that the OD methods 
have advantages over the other two methods in terms of 
performance. In addition to the simulations, we examine 
gene ranking results across methods for exon array 
leukemia expression data in the context of corresponding 
functional assay results (siRNA hits performed on samples 
from the same patients). We show that the OD methods 
provide more flexibility than the Zscore and Rscore and 
further show that the OD method performs similarly or 
better than the Zscore for two analytical use cases relating 
the expression data to the siRNA results. 
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Methods 

Simulations 

Data were initially simulated from either a normal distribu- 
tion with mean of seven and standard deviation of one 
or a t-distribution with 15 degrees of freedom and a 
non-centrality parameter set to seven. Each data set 
generated contains 10,000 genes and 20 samples. These 
distributions and parameters were chosen as they had 
similar ranges as those from robust multi-array average 
(RMA) summarized Affymetrix arrays as well as represent- 
ing the extremes of sample-sample expression variability. 
These distributions are depicted in Figure lA along with a 
hypothetical distribution which would likely mirror that of 
expression values from a given patient cohort. For a given 
sample and set of genes, a specified value (three, four or 
five) was added to introduce the true outlier (s). Similarly, 
negative two was added to the specified set of genes and 
samples that were simulated as overall sample technical 
outliers. The negative value here was chosen because tech- 
nical outliers more frequently result in decreased intensity 
values for Affymetrix arrays in our experience. The P-value 
and false discovery rate (FDR) for each statistic was 
calculated similar to previous work [11,43] as: 



Pvalucj 



«-i 



FDRi 



(1) 



(2) 



Where /() is the indicator function, S is the number of 
true positive genes (100 in all the simulations), T is a 
vector of length n containing the absolute value of the 
statistics from a given method, and O is an ordered 
version of T such that its elements are decreasing. The 
set Gj is limited to a single integer in (1) whereas G is 
a set of size S in (2) representing the position in T or O 
respectively with the true outlier gene(s) for a given 
sample Note that the FDR in this case assumes 100 
true positives, which is meant to simulate an activated 
pathway. The sample assessed for these statistics was 
always the one with simulated outlier gene expression. 
Each different combination of parameters was run 
10,000 times. Power was computed as the proportion 
of the 10,000 iterations which were significant at the .05 
level. The FDR was reported as the average FDR observed 
over the 10,000 iterations for each simulation. Simulations 
and calculation of the P-values and FDR were carried 
out in parallel on a Beowulf-style cluster using Rmpi 
[44] with parallel random number generation using 
UEcuyer s method [45] via the rlecuyer package [46] using 
R-2.15.1 [47]. Plotting and summaries for the simulations 
were performed using ggplot2 [48] on R-3.0.1 [47]. Example 
code for performing these simulations is provided in 
Additional file 1. 



RNA sample preparation and array processing 

All research was performed according to the guidelines of 
the Helsinki Declaration. Specifically, both oral and written 
informed consent was obtained from the patient or parent/ 
legal guardian for inclusion in the study. Assent was also 



B 






Expression Values 
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Figure 1 The outlying degree outperforms other methods in both high and low variability simulated datasets. (A) Expression data was 
simulated from two distributions (normal with mean of seven and standard deviation of one as well as a t-distribution with non-centrality parameter 
set to seven and the degrees of freedom equal to fifteen) that were at the extremes of what would be typically observed in microarray data with the 
distribution of hypothetical patient data situated somewhere in the middle. (B, C) The outlying degree {k = 9) significantly outperformed both the 
Zscore and Rscore method in terms of power and false discovery for all combinations of effect size and distribution type. However, all the methods 
were only effective when encountering high effect sizes (four to five) with low variability (normal distribution). The grey areas indicate 0.95 confidence 
intervals. Note that for the false discovery rate, the estimates were very stable and the grey area is not readily observable. OD, outlying degree. 
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obtained for patients between ages 7 and 17 years. The 
study was reviewed and approved by the Institutional 
Review Board of Oregon Health & Science University. 

Mononuclear cells from blood or bone marrow of 
patients with hematologic malignancies were isolated 
using a FicoU gradient. Cell pellets were snap-frozen in 
liquid nitrogen and cryopreserved at -80°C for subsequent 
batch extraction of RNA. RNA was extracted using 
Qiagen RNeasy kits (Qiagen, Valencia, CA) according 
to the manufacturer instructions, including perform- 
ance of the optional on-column DNase treatment step. 
Samples were amplified and labeled using the Ambion 
WT Expression/ Affymetrix WT Terminal Labeling proto- 
col (Afiymetrix, Santa Clara, CA). Amplified and labeled 
cDNA target samples were each hybridized to a Human 
Exon 1.0 ST array (Affymetrix, Santa Clara, CA). Image 
processing was performed using Affymetrix Command 
Console (AGCC) v.3.1.1 software. The expression data 
have been deposited in the Gene Expression Omnibus 
database under the identifier [GEO:GSE42731]. Twelve 
samples had acceptable RNA integrity number scores 
(>8) and similar overall intensity distributions and 
were analyzed further. These samples were processed 
using the oligo [49] Bioconductor package [50]. Back- 
ground correction and normalization performed via the 
RMA method [51] using the core metaprobesets as well as 
probesets. Note that probeset-level summarizations were 
used for the visualizations whereas metaprobeset-level 
summarizations were used for the methods comparisons. 
Here we consider metaprobesets to denote single tran- 
scripts or genes. Ensembl annotations and coordinates 
were retrieved from Ensembl Build 69 [52] and used in 
conjunction with the GenomicFeatures package [53] to 
provide gene contexts for the (meta)probesets. All of the 
appUed analysis was performed using R-3.0.1 [47] with 
plotting again performed using ggplot2 [48] as well as 
GenomeGraphs [54]. The reshape2 [55] and biomaRt [56] 
packages were also utilized. 



standardization (Rscore) of the X matrix as described 
previously [10]: 



Rscore, 



I) 



Xij-median{xi) 
mad{xi) 



(4) 



where the mad function is the median absolute deviation. 

The OD method [29,30], is a simple, distance-based 
method for assigning a score to a given sample indicating 
the degree with which it differs from the k nearest samples 
for a given gene. For both the OD method and weighted 
variants of the outlying degree (WOD), we first define 
an expression distance matrix D of dimension n x m-1 
containing the absolute values of the pairwise expression 
differences between the current sample ; and the remaining 
samples in the cohort indexed by for gene /. That is, each 
element of the D matrix denoted as dwi is defined as: 



da 



forj'T^j 



(5) 



For each gene, the absolute expression differences dij'j 
are sorted in increasing order giving the corresponding f 
values provided in O/,, which is indexed by / taking values 
from 1 to The outlying degree value for each sample 
and each gene, ODip is the sum of the first k elements of 
the dij vector ordered by the o^y vector as is shown in (6) 
and (7). 

Oij = f 1 m-i such that dtp < ... < di^ (6) 

k 

1=1 



(7) 



The WOD methods are similar to the OD method but 
rely on a weight matrix describing the sample-to-sample 
dissimilarity, which is the mx m matrix W and is defined 
by the Euclidean distances between samples as in (8). 



En I 
i=i rir^ij 



(8) 



Expression prioritization approaches 

Let Xij be the expression or simulated expression data 
at gene /=l,...,w and sample ; = l,...,m. A commonly 
used way to screen for outliers is the Zscore, which is 
the X matrix mean centered and scaled by the standard 
deviation {sd): 



Zscoreu 



Xij-mean{xi) 
sd{xi) 



(3) 



where Xi indicates application of the function to the 
entire vector of sample expression values for gene /. 
Similar to the Zscore method, we define a robust 



The simplest variation of the OD is WODa, which 
weights each of the k nearest distances by the scaled 
Euclidean distance between the two samples in question. 
That is, the weights are applied after determining the 
nearest absolute expression differences. 



WODau 



(9) 



By contrast, the WODb approach first applies the 
scaled weights, computes the nearest absolute expression 
differences and then finds the sum of the k nearest 
weighted differences. One difference between (9) and (11) 
is that the value used to scale the weights is based on 
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the sum of the weights associated with the k nearest 
differences in (9) and the sum of the non-diagonal weights 
in (11). 

= f 1.../ .-1 such that < ... < (10) 

WODb, = ^'-' ' '^'^ ^ (11) 

;■' 

For all the OD methods, k was set to nine or six for the 
simulated and real data respectively, based on the simula- 
tions in Figures S3 and S4 in Additional file 2. An imple- 
mentation of these methods is provided in Additional file 1 
and will be provided as part of an R package pod' at [57]. 

Results and discussion 

Methods and parameters 

The Zscore as defined (see Methods) is a simple approach 
to assessing whether an outlier exists in a moderately sized 
dataset [58]. However, its use of the difference from the 
mean as the numerator (as well as the standard deviation 
in the denominator) means that it potentially could be 
influenced by outliers itself This is a well-known property 
of related procedures based on means and many alter- 
natives exist to reduce the influence of outliers, such as 
the use of trimmed means or medians. The median-based 
robust analogue of the Zscore utilizes the difference from 
the median divided by the median absolute deviation 
(Rscore) as has been suggested in some of the initial 
work in looking for genomic outliers [10]. The OD, as 
implemented (see Methods), is a measure of how different 
the expression value for a given sample is from the expres- 
sion values from the k nearest samples for a given gene. 
The choice of the k parameter in this respect is important 
as it may impact sensitivity and specificity. The k parameter 
can take integer values between 1 and m -1 (assuming 
m >1) with the case of k = 1 equivalent to the absolute 
difference between the given sample and the most similar 
of the remaining samples for a given gene. For the case of 
finding genes containing single sample outliers, we carried 
out several simulations examining both power and FDR 
for a wide range of k values. For our simulation size of 20 
samples, we found that k=9 seemed to provide good 
performance over a range of effect sizes (Figures SI 
and S2 in Additional file 2) with relatively little additional 
performance gains above nine. In general a k value set to a 
value near m/2 seemed to provide adequate performance 
for cohort sizes >10 (Figures S3 and S4 in Additional 
file 2). Note that this assumes that the conditions of 
the simulation roughly approximate that of the dataset 
in question and that one is mainly interested in finding 
single sample outliers. This is likely to be the case for 



the simulations as they were carried out using similar 
parameters. Utilizing a different k value may influence 
power and FDR estimates for a given simulation, though 
from these simulations it appears that decreases in per- 
formance would mainly occur when utilizing a substantially 
lower k value. Although performance is difficult to assess 
using experimental data, we argue that for detection of 
single sample outliers it is similar enough due to the RMA 
preprocessing, which makes the overall expression distri- 
butions more comparable to each other as well as having 
a range of expression values similar to the simulations. 

Evaluation using simulations 

Several aspects of the OD method could be improved based 
on an examination of actual array experiments. First, 
overall dissimilarities between samples could inappropri- 
ately increase the score for a given gene, making it desirable 
to down-weight sample-sample differences based on a 
measure of overall dissimilarity. An example of this would 
be an array that had a subset of genes with dissimilar 
hybridization characteristics but not to the extent that it 
would be removed for quality control purposes. Also, 
this would be important in a precision medicine context 
as we would expect samples to vary in similarity based 
on technical and biological factors. A straightforward 
adaptation of the OD method would be to incorporate 
weights that would decrease the influence on sample- 
sample comparisons for a given gene if the samples 
themselves were highly dissimilar. Based on previous work 
in the field of spatial statistics [59], we implemented 
several variants of the weighted OD, the only difference 
being whether the weighting was taken into account 
before (WODb) or after (WODa) the nearest neighbors 
were computed. 

We first compared all methods using a straightforward 
power simulation where a single gene had a single sample 
outlier with a true effect size ranging from three to 
five units, and where data were either generated from a 
re-centered normal or t-distribution to capture the range 
of sample-sample variability to which actual samples 
might belong (Figure lA). Weighting the OD method 
based on overall sample dissimilarity in this context had 
no benefit over the basic OD approach as all samples 
would be overall very similar as a product of the simulation 
approach. However, the OD methods had significantly 
higher power than either the Zscore or Rscore in all six 
simulations (Figure IB). Even for the normal distribution 
simulation, large effect sizes of four or five were necessary 
to reach high power (>0.9) for all methods whereas only 
the OD method achieved adequate power (>0.8) at the 
lowest evaluated effect size. For the t-distribution, no 
method was able to achieve adequate power even at the 
highest effect size. An analogous simulation addressing 
the FDR was also carried out, which demonstrated that 
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the OD method overall had lower FDR values (all six were 
significantly lower than Zscore or Rscore; Figure IC). For 
both distribution types, the FDR was high especially for an 
effect size of three. The OD method was the only one to 
achieve acceptable FDR (<0.10) at an effect size of five for 
the normal distribution. Together, this indicates that 
reasonable power should be achievable for experimental 
samples at the expense of higher false discovery for effect 
sizes greater than four. As discussed below, effect sizes of 
this magnitude are observable in expression datasets. 

A more realistic case involves one or more samples 
containing genes with lower expression on average than 
the remainder of the cohort. This is often seen due to 
technical issues that affect the overall hybridization 
characteristics of a given array. We simulated a rather 
extreme situation where 2,500 or 7,500 genes in one or 
three samples were affected by such a technical issue 
and therefore were two units lower on average than 
the remainder of the samples (Figure 2A). In each 
case, we considered the situation where the sample 
and gene(s) with the true outlier effect were among 
those impacted by the technical factor. Otherwise data 
were simulated from the normal distribution with an 
effect size of five. Overall, the OD methods had signifi- 
cantly higher power and lower FDR values in all four simu- 
lations (Figure 2B,C). Differences between the three OD 
variants were observed when there were three affected 
samples with the WODb variant having additional per- 
formance gains over the other two methods. In all cases, 
performance was seriously hampered by the introduction 



of the technical factor, meaning that that these proce- 
dures will only perform adequately if all samples are 
overall similar. 

Evaluations using experimental data 

We next applied all five methods to an experimental 
dataset consisting of samples from 12 pediatric patients 
with acute B lymphoblastic leukemia run on Affymetrix 
Exon arrays (see Methods). For the OD methods we set 
k to six. We first determined the number of genes that 
roughly fell into our simulation effect size categories of 
three, four and five. This was done by computing the 
difference between the sample with the highest gene 
expression value and the sample with the second highest 
gene expression for a given gene (note that this is 
equivalent to using the OD method with /c = 1 for the 
sample with the highest gene expression). We refer to 
this value as the delta and it assumes that there is a 
single sample that is up-regulated relative to the rest for 
a given gene, which is the case in the simulations. We 
found that there were 3, 14 and 55 genes respectively in 
each of the effect size categories. As the delta is computed 
per gene and does not convey sample level information 
we determined the ranks for the patient sample with the 
highest expression value. Focusing on the genes with delta 
values of four or greater, the OD methods performed 
similarly and all ranks were within the top 10 for the 
given patient sample. In 9 out of 14 cases (64%), the 
OD method ranked equal to or higher than the Zscore 
method, and in 10 out of 14 (71%) when compared to 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 1 3 1 3 1 3 1 3 

Samples Number of Divergent Samples Number of Divergent Samples 



Figure 2 The weighted outlying degree can attenuate the effect of sample-specific technical variability. (A) An example of a simulated 
dataset from the normal distribution with a technical factor affecting 2,500 of the 10,000 genes of sample one, making it divergent. The size of 
the effect is a two-unit decrease. (B, C) display power and false discovery rate estimates for the methods based on similar simulations to (A), 
where either 2,500 or 7,500 genes of one or three samples were affected. The effect size was kept at five units. The WODb method outperforms 
the others at least for the case where the number of divergent samples was equal to three. The grey areas indicate 0.95 confidence intervals. 
Note that for the false discovery rate, the estimates were very stable and the grey area is not readily observable. OD, outlying degree method; 
WODa, weighted outlying degree with weighting performed after nearest neighbor computations; WODb, weighted outlying degree with 
weighting performed before nearest neighbor computations. 
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the Rscore method. Because the Rscore had relatively 
poor performance in the simulations, and the weighted 
variants of the OD method are most useful in cases of 
large technical differences for multiple samples, we then 
focused on the comparison between the OD method and 
the Zscore. To quantif)^ the differences between the two 
methods, we examined the top 20 genes for patient 
sample 09206 from the Zscore and OD method and 
found that, in general, the Zscore method ranked higher 
those genes with low sample-sample variability outside of 
a single outlier whereas the OD method {k = 6) tolerated 
greater variability. We quantified this by computing the 
standard deviation after removing the highest expression 
value for the top 20 genes from both methods and observed 
that the median value of this standard deviation from the 
OD method was 0.411 (range: 0.144 to 1.784) whereas for 
the Zscore it was 0.174 (range: 0.051 to 0.384; Figure 3 and 
Table SI in Additional file 3). As shown in Figure 3, the top 
ranked genes for the OD and Zscore methods, PTPRM and 
TDRD9, exhibited clear gene-level over-expression. We 
note that knockdown of PTPRM has been previously 
suggested to decrease cell growth and survival in glioblast- 
oma multiforme [60], suggesting its possible inclusion 
in a future iteration of the siRNA panel. Less seems to 
be known about TDRD9, It should be noted that the k 
parameter provides a mechanism through which the 
user can control the type of events that are prioritized 
for a given sample. For example, increasing k allows 
more sample-sample variability and therefore the rankings 
will be more divergent from the Zscore, decreasing k 
will do the opposite (Figure S5 in Additional file 2). The 
user can choose k based on his/her hypothesis regarding 



the sample-sample differences, keeping in mind its effect 
on power and false discovery as discussed above. 

As an initial applied analysis, we examined the results 
of the OD and Zscore in the context of the patient sample 
T119, which had an siRNA hit for RORl (Table S2 in 
Additional file 4). We chose patient sample T119 as it 
had only a single siRNA hit and therefore we could 
expect some dysregulated genes that were unique to the 
sample, demonstrating the arguably most common use 
case for the Zscore. Overexpression of RORl in acute 
lymphoblastic leukemia samples with the t(l;19) trans- 
location has been previously characterized [61] and it 
was hypothesized that the resulting ftision of the genes E2A 
{TCF3) and PBXl halt the development of the progenitor B 
cells and continue the expression of RORl along with 
the preBCR complex. RORl and the preBCR complex 
contribute to proliferation and survival through the PI3K, 
AKT and MEK/ERK pathways. Examining the expression 
of E2A and PBXl in our dataset, we found that E2A was 
highly expressed across all samples while PBXl was highly 
expressed in sample T119 with moderate or low expression 
in the other samples. As a result, PBXl was ranked first 
and second for the Zscore and OD methods respectively 
for sample T119. It has also been previously suggested 
by their joint down- regulation following siRNA treatment 
of the fusion product [62] that EB-1 (ANKSIB) and one iso- 
form of WNT16 were also up-regulated as a consequence 
of the E2A-PBX1 ftision. The ANKSIB gene was ranked 
first by the OD method and ninth by the Zscore method, 
while the WNT16 gene was ranked 11^^ by both methods. 
RORl itself was ranked 28th and 30th for the Zscore and 
OD methods respectively. This demonstrates that both 




Figure 3 The outlying degree is more robust to variability across samples than the Zscore in experimental data. (A) The top five genes 
for both the Zscore and outlying degree method were found for sample 09206. From comparison purposes we plotted the distribution of the 
expression levels of the 12 patient samples for the top five ranked genes in either method. It was clear that the Zscore ranked higher those 
genes where a single outlier was found with the remainder of the samples tightly grouped together whereas the outlying degree {k = 6) ranked 
higher those genes with large differences while tolerating more expression variability between the samples. Exon-level summary of the genes 
ranked the highest in (A) (Rank 1) are shown for both (B) the Zscore and (C) the outlying degree methods. 

V . J 
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the OD and Zscore methods are effective at pulling out 
potential gene expression signatures related to a specific 
patients disease characteristics when divergent from the 
rest of the cohort. 

Another use case is the identification of commonalities 
among patient samples that share one or more gene targets. 
In our dataset, the siRNA hit for samples 09206 and 08419 
was shared {TNKl) so a natural question is whether 
there are genes that have shared expression dysregulation 
between the pairs (Table S2 in Additional file 4). As they 
differ by gender, we first took the step of removing 
expression differences due to gender. This was done by 
first fitting a linear model contrasting the genders for all 
genes and using the resulting matrix of residuals from 
the model fit as the expression matrix. We then reran 
the OD and Zscore methods on the matrix of residuals 
and compared the ranks of the top 20 of each sample 
using both methods. For the OD method, the TMPRSS15 
gene was shared between the top 10 genes in the two 
patient samples ranked second and third for 09206 and 
08419 respectively. By comparison, the Zscore method 
appUed to both samples did not share any genes in the 
top 10, in fact the lowest ranked gene for 08419 that was 
in the top 10 of 09206 ranked 10,424^^. This demonstrates 
that the OD method can, in addition to finding the diver- 
gently expressed genes for a single sample, identify and 
prioritize those genes with shared dysregulation between 
samples with similar functional or clinical phenotypes. 

Conclusions 

We have addressed the motivating problem of how to 
detect patient-specific expression dysregulation events, 
as well as providing guidelines and considerations for these 
types of analyses. Our emphasis was on the situation where 
one sample was an outlier relative to the rest and on small 
to moderate cohort sizes, which was representative of our 
cohort of patient samples also analyzed using an siRNA 
sensitivity screen. We benchmarked several methods, 
Zscore and Rscore as well as several variants of the OD 
method, under a variety of conditions including different 
effect sizes and the introduction of technical noise. We 
determined that the OD method performed equally 
well or better than the others in the majority of our 
simulations in terms of power and FDR. The weighted 
variants of the OD methods had greater performance 
when a large amount of technical noise was introduced 
into the simulations. 

When these methods were applied to a set of 12 expres- 
sion arrays from acute B lymphoblastic leukemia samples, 
we showed that the OD method ranked the majority of 
high effect size genes higher or equivalent to Zscore or 
Rscore. Focusing on the Zscore and OD comparison, we 
found that the Zscore ranked higher those genes that 
had low sample-sample variation outside of a single outlier. 



whereas the OD method was more tolerant of sample- 
sample variation depending on the choice of k It was 
further shown that the results of an OD run with k=l 
were more similar to Zscore than OD runs with higher 
k values. When examining the expression data in the 
context of the siRNA hits, we noted that the pattern of 
hits derived from the siRNA screen could either be unique 
to a cohort or be similar among multiple members. This 
implies that related gene expression outliers should either 
be unique or shared. The OD was able to robustly 
prioritize such unique and shared genes whereas the 
Zscore was only effective at finding the former. We 
note that there are other similar contexts in which these 
methods may be successfully applied outside of finding 
genes related to siRNA screens. For instance, one could 
find genes related to adverse clinical outcomes that affect 
only one or two subjects in a given small to medium sized 
cohort. 

Here, we focused on the detection of genes containing 
sample expression values that were up-regulated relative 
to the remaining samples. The OD method can also be 
applied for the detection of down-regulated genes, by 
determining the sign of the difference from the sample 
in question and the mean or median of the samples for a 
given gene. 

One of the difficulties of focusing on the detection of 
outliers for a given set of samples is that it is much more 
difficult to control for potential confounders, because any 
number of technical or biological factors can impact a 
given sample in a high throughput expression experiment. 
One way to address known confounders would be the 
application of these methods to the residuals from a least 
squares fit or robust alternative, as we demonstrated 
through the correction of gender effects. Protecting against 
unknown confounders as in the surrogate variable analysis 
method [42] would seem a natural extension to this idea 
though further research would be necessary. 

For our simulations, we assumed that the overall dis- 
tributions between the samples were highly similar. This 
assumption is likely to be valid for Affymetrix arrays when 
RMA [51] preprocessing and summarization is applied 
due to the default use of quantile normalization [63]. 
Because RMA requires the arrays to be preprocessed 
together, it is desirable to have the expression distributions 
as comparable as possible to ensure the expression esti- 
mates are accurate. As the B-ALL dataset described here 
was processed in a single batch and each sample analyzed 
relative to other members of the batch, the RMA procedure 
was utilized. If multiple batches or even single arrays are 
analyzed together, a variant of RMA, frozen RMA [64], is 
an alternative. 

This work represents a step towards the analysis of 
patient samples in a personalized or precision medicine 
context. We found that the OD method was more efficient 



Bottomly et at. Genome Medicine 2013, 5:103 
http://genonnennedicine.conn/content/5/1 1/103 



Page 9 of 10 



at the task of prioritizing gene expression outliers than 
other alternatives. Also, by being able to take into account 
overall sample dissimilarities, it is better suited to address 
the issues inherent in such a clinical paradigm where 
analysis should not ideally wait for sufficient sample 
accrual before processing and analysis. The OD method 
provides the user with the ability to potentially detect gene 
expression dysregulation events shared between several 
samples. It can be used in relatively small cohorts and has 
high power in that scenario to detect outlier samples if 
there is a high effect size and relatively little sample-sample 
variability. We note that these requirements appear to 
be satisfied in the dataset examined here. Because of 
this, the OD can perform well in many situations and 
provides a robust analytical approach for the detection 
of patient-specific events. 

Additional files 

The following additional data are available with the online 
version of this paper. 
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